156 research outputs found

    Revisiting Guerry's data: Introducing spatial constraints in multivariate analysis

    Full text link
    Standard multivariate analysis methods aim to identify and summarize the main structures in large data sets containing the description of a number of observations by several variables. In many cases, spatial information is also available for each observation, so that a map can be associated to the multivariate data set. Two main objectives are relevant in the analysis of spatial multivariate data: summarizing covariation structures and identifying spatial patterns. In practice, achieving both goals simultaneously is a statistical challenge, and a range of methods have been developed that offer trade-offs between these two objectives. In an applied context, this methodological question has been and remains a major issue in community ecology, where species assemblages (i.e., covariation between species abundances) are often driven by spatial processes (and thus exhibit spatial patterns). In this paper we review a variety of methods developed in community ecology to investigate multivariate spatial patterns. We present different ways of incorporating spatial constraints in multivariate analysis and illustrate these different approaches using the famous data set on moral statistics in France published by Andr\'{e}-Michel Guerry in 1833. We discuss and compare the properties of these different approaches both from a practical and theoretical viewpoint.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS356 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Discriminant analysis of principal components: a new method for the analysis of genetically structured populations

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The dramatic progress in sequencing technologies offers unprecedented prospects for deciphering the organization of natural populations in space and time. However, the size of the datasets generated also poses some daunting challenges. In particular, Bayesian clustering algorithms based on pre-defined population genetics models such as the STRUCTURE or BAPS software may not be able to cope with this unprecedented amount of data. Thus, there is a need for less computer-intensive approaches. Multivariate analyses seem particularly appealing as they are specifically devoted to extracting information from large datasets. Unfortunately, currently available multivariate methods still lack some essential features needed to study the genetic structure of natural populations.</p> <p>Results</p> <p>We introduce the <it>Discriminant Analysis of Principal Components </it>(DAPC), a multivariate method designed to identify and describe clusters of genetically related individuals. When group priors are lacking, DAPC uses sequential K-means and model selection to infer genetic clusters. Our approach allows extracting rich information from genetic data, providing assignment of individuals to groups, a visual assessment of between-population differentiation, and contribution of individual alleles to population structuring. We evaluate the performance of our method using simulated data, which were also analyzed using STRUCTURE as a benchmark. Additionally, we illustrate the method by analyzing microsatellite polymorphism in worldwide human populations and hemagglutinin gene sequence variation in seasonal influenza.</p> <p>Conclusions</p> <p>Analysis of simulated data revealed that our approach performs generally better than STRUCTURE at characterizing population subdivision. The tools implemented in DAPC for the identification of clusters and graphical representation of between-group structures allow to unravel complex population structures. Our approach is also faster than Bayesian clustering algorithms by several orders of magnitude, and may be applicable to a wider range of datasets.</p

    Spatiotemporal dynamics in the early stages of the 2009 A/H1N1 influenza pandemic.

    Get PDF
    Epidemiology and public health planning will increasingly rely on the analysis of genetic sequence data. The ongoing influenza A/H1N1 pandemic may represent a tipping point in this trend, with A/H1N1 being the first human pathogen routinely genotyped from the beginning of its spread. To take full advantage of this genetic information, we introduce a novel method to reconstruct the spatiotemporal dynamics of outbreaks from sequence data. The approach is based on a new paradigm were ancestries are inferred directly rather than through the reconstruction of most recent common ancestors (MRCAs) as in phylogenetics. Using 279 A/H1N1 hemagglutinin (HA) sequences, we confirm the emergence of the 2009 flu pandemic in Mexico. The virus initially spread to the US, and then to the rest of the world with both Mexico and the US acting as the main sources. While compatible with current epidemiological understanding of the 2009 H1N1 pandemic, our results provide a much finer picture of the spatiotemporal dynamics. The results also highlight how much additional epidemiological information can be gathered from genetic monitoring of a disease outbreak

    Consensus genetic structuring and typological value of markers using multiple co-inertia analysis

    Get PDF
    Working with weakly congruent markers means that consensus genetic structuring of populations requires methods explicitly devoted to this purpose. The method, which is presented here, belongs to the multivariate analyses. This method consists of different steps. First, single-marker analyses were performed using a version of principal component analysis, which is designed for allelic frequencies (%PCA). Drawing confidence ellipses around the population positions enhances %PCA plots. Second, a multiple co-inertia analysis (MCOA) was performed, which reveals the common features of single-marker analyses, builds a reference structure and makes it possible to compare single-marker structures with this reference through graphical tools. Finally, a typological value is provided for each marker. The typological value measures the efficiency of a marker to structure populations in the same way as other markers. In this study, we evaluate the interest and the efficiency of this method applied to a European and African bovine microsatellite data set. The typological value differs among markers, indicating that some markers are more efficient in displaying a consensus typology than others. Moreover, efficient markers in one collection of populations do not remain efficient in others. The number of markers used in a study is not a sufficient criterion to judge its reliability. "Quantity is not quality"

    EpiJSON: A unified data-format for epidemiology

    Get PDF
    AbstractEpidemiology relies on data but the divergent ways data are recorded and transferred, both within and between outbreaks, and the expanding range of data-types are creating an increasingly complex problem for the discipline. There is a need for a consistent, interpretable and precise way to transfer data while maintaining its fidelity. We introduce ‘EpiJSON’, a new, flexible, and standards-compliant format for the interchange of epidemiological data using JavaScript Object Notation. This format is designed to enable the widest range of epidemiological data to be unambiguously held and transferred between people, software and institutions. In this paper, we provide a full description of the format and a discussion of the design decisions made. We introduce a schema enabling automatic checks of the validity of data stored as EpiJSON, which can serve as a basis for the development of additional tools. In addition, we also present the R package ‘repijson’ which provides conversion tools between this format, line-list data and pre-existing analysis tools. An example is given to illustrate how EpiJSON can be used to store line list data. EpiJSON, designed around modern standards for interchange of information on the internet, is simple to implement, read and check. As such, it provides an ideal new standard for epidemiological, and other, data transfer to the fast-growing open-source platform for the analysis of disease outbreaks

    Climate shaped the worldwide distribution of human mitochondrial DNA sequence variation

    Get PDF
    There is an ongoing discussion in the literature on whether human mitochondrial DNA (mtDNA) evolves neutrally. There have been previous claims for natural selection on human mtDNA based on an excess of non-synonymous mutations and higher evolutionary persistence of specific mitochondrial mutations in Arctic populations. However, these findings were not supported by the reanalysis of larger datasets. Using a geographical framework, we perform the first direct test of the relative extent to which climate and past demography have shaped the current spatial distribution of mtDNA sequences worldwide. We show that populations living in colder environments have lower mitochondrial diversity and that the genetic differentiation between pairs of populations correlates with difference in temperature. These associations were unique to mtDNA; we could not find a similar pattern in any other genetic marker. We were able to identify two correlated non-synonymous point mutations in the ND3 and ATP6 genes characterized by a clear association with temperature, which appear to be plausible targets of natural selection producing the association with climate. The same mutations have been previously shown to be associated with variation in mitochondrial pH and calcium dynamics. Our results indicate that natural selection mediated by climate has contributed to shape the current distribution of mtDNA sequences in humans

    Bayesian inference of transmission chains using timing of symptoms, pathogen genomes and contact data.

    Get PDF
    There exists significant interest in developing statistical and computational tools for inferring 'who infected whom' in an infectious disease outbreak from densely sampled case data, with most recent studies focusing on the analysis of whole genome sequence data. However, genomic data can be poorly informative of transmission events if mutations accumulate too slowly to resolve individual transmission pairs or if there exist multiple pathogens lineages within-host, and there has been little focus on incorporating other types of outbreak data. We present here a methodology that uses contact data for the inference of transmission trees in a statistically rigorous manner, alongside genomic data and temporal data. Contact data is frequently collected in outbreaks of pathogens spread by close contact, including Ebola virus (EBOV), severe acute respiratory syndrome coronavirus (SARS-CoV) and Mycobacterium tuberculosis (TB), and routinely used to reconstruct transmission chains. As an improvement over previous, ad-hoc approaches, we developed a probabilistic model that relates a set of contact data to an underlying transmission tree and integrated this in the outbreaker2 inference framework. By analyzing simulated outbreaks under various contact tracing scenarios, we demonstrate that contact data significantly improves our ability to reconstruct transmission trees, even under realistic limitations on the coverage of the contact tracing effort and the amount of non-infectious mixing between cases. Indeed, contact data is equally or more informative than fully sampled whole genome sequence data in certain scenarios. We then use our method to analyze the early stages of the 2003 SARS outbreak in Singapore and describe the range of transmission scenarios consistent with contact data and genetic sequence in a probabilistic manner for the first time. This simple yet flexible model can easily be incorporated into existing tools for outbreak reconstruction and should permit a better integration of genomic and epidemiological data for inferring transmission chains

    Modelling that shaped the early COVID-19 pandemic response in the UK.

    Get PDF
    Infectious disease modelling has played an integral part of the scientific evidence used to guide the response to the COVID-19 pandemic. In the UK, modelling evidence used for policy is reported to the Scientific Advisory Group for Emergencies (SAGE) modelling subgroup, SPI-M-O (Scientific Pandemic Influenza Group on Modelling-Operational). This Special Issue contains 20 articles detailing evidence that underpinned advice to the UK government during the SARS-CoV-2 pandemic in the UK between January 2020 and July 2020. Here, we introduce the UK scientific advisory system and how it operates in practice, and discuss how infectious disease modelling can be useful in policy making. We examine the drawbacks of current publishing practices and academic credit and highlight the importance of transparency and reproducibility during an epidemic emergency. This article is part of the theme issue 'Modelling that shaped the early COVID-19 pandemic response in the UK'
    corecore